Cream of the Crop 21

home *** CD-ROM | disk | FTP | other *** search

/ Cream of the Crop 21 / Cream of the Crop 21 (Terry Blount) (October 1996).iso / comm / htmst608.zip / HTMSTRIP.DOC < prev next >

Wrap

Text File | 1996-08-31 | 21KB | 414 lines

HTMSTRIP.DOC 1 Revised: 08-31-96 The HTMSTRIP.EXE program attempts to read HTML pages, remove the HTML coding, and write the file out as something more useful. Features of this program: * Can be run across an entire subdirectory (for example, your entire cache subdirectory), and will only process the HTML documents that it finds. (There are some options on this.) * Removes all imbedded HTML commands. * Recodes the standard HTML "entity references" (e.g. "©" becomes "(c)"). The actual replacements are coded in a user-modifiable lookup file. * Handles standard indent, heading, selection groups, menus, tables, etc. * Reflows all text as appropriate * Optionally, will replace Link, Image, and Input references with user-definable text representations. * Optionally, alerts you to possible errors in the HTML code itself. HTML codes are surrounded within <...> indicators. For upward compatibility reasons, Web browsers ignore any codes that they don't understand and just process the ones they can handle. Note that the HTMSTRIP command is currently geared for handling HTML 2.0 files and then Netscape table-specific extensions (added to HTML 3.0). HTMSTRIP removes all HTML codes. It also handles the standard HTML "&xxx;" "entity references" (e.g. "©" is replaced by "(c)"). You can add or change these replacements as desired by using the INI file (documented later). HTMSTRIP is also tuned to allow it to specially-handle several embedded HTML codes. These codes are the following: <A ...> External link <BLOCKQUOTE>...</BLOCKQUOTE> Indented block of text <BR> Forced line break <CAPTION>...</CAPTION> Title for a table <CENTER>...</CENTER> Centering text <DD> Term definition <DIR>...</DIR> Directory list of items </DL> End of definition list <DT> First term of definition list/glossary <H1> to <H6>...</H1> to </H6> Heading items <HR> Horizontal rule <IMG ...> Image <INPUT ...> User input <LI> Menu/Ordered/Unordered/Directory list item <MENU>...</MENU> Menu listing <OL>...</OL> Ordered listing <OPTION> Used for single/multiple choice menus <P> Paragraph indicator <PRE>...</PRE> Preserve spacing block (preformatted text) <SCRIPT>...</SCRIPT> Java script blocks are ignored <SELECT>...</SELECT> Block for single/multiple choice menu <TABLE>...</TABLE> Table block <TD>...</TD> Table data (cell) <TH>...</TH> Table heading <TITLE>...</TITLE> Title item <TR>...</TR> Table row <UL>...</UL> Unordered listing HTMSTRIP.DOC 2 Revised: 08-31-96 If you run across other codes that become vital, let me know and I'll try to handle them somehow. How to get HTML files: Some people who are using regular Web browsers like Mosaic or Netscape don't realize that they're automatically saving HTML files to their hard disk throughout every Web session. That's because just about every Web browser saves the most-recently accessed files from the Web (including HTML source code, GIF's, and JPG's) on your hard disk and reads them from there instead of requiring you to download them every time you go back to a previous page. This is typically settable by you under "Preferences" and "Cache" on your Web browser. I usually set my Web browser to have a huge cache, maybe 10MB. Anything beats downloading the same pages over again even at 28.8K. And I make sure that I do not have anything specified like "clear cache at the end of every session". Then I just go through the files in the cache subdirectory afterward and reprocess them. Two disadvantages to a cache... It takes up hard disk space but, hey, the Web browser is typically in Windows so why are you surprised. The second disadvantage is that if the page actually changes between sessions, you typically won't notice the new page as long as it remains in your cache. If you think a page is still in cache and should have been changed but didn't, you can typically ask your Web browser to reload the page. On some browsers, this is shown as an arrow in the form of a circle. HTMSTRIP can process the entire cache subdirectory. It automatically detects non-HTML files for you and processes accordingly. It creates new text file versions of just the HTML pages it finds. By the way, for some reason, the current beta version of Netscape typically ignores my cache setting for some reason. I don't have the slightest idea why. As a result, when you Alt-F4 out of Netscape, it goes through and deletes all but a few of the temporary files. This is annoying to say the least. As a result, I have to run HTMSTRIP from a DOS window just before leaving Netscape. If anyone knows why it does this to me, please let me know! Specifying parameters: Parameters for this program can be set in the following ways. The last setting encountered always wins: - Read from an *.INI file (see BRUCEINI.DOC file), - Through the use of an environmental variable (SET HTMSTRIP=whatever), or - From the command line (see "Syntax" below) HTMSTRIP.DOC 3 Revised: 08-31-96 Defining entity references: HTMSTRIP will process an entity reference definition file is one is found. This table can be in your standard *.INI file (e.g. HTMSTRIP.INI) if desired or it can be a separate file specified using the /Linitfile parameter. Entity references are how non-standard characters like the copyright character are handled in HTML pages. Entity references are indicated as "&xxx;" where "xxx" is either a code or a number preceded by a pound sign. The copyright symbol is indicated in HTML as "©". A default HTMSTRIP.INI is provided with over 230 entity reference lookups. To define or change these lookups, the INI file should include a series of lines in the following format: &xxx; = outstr where "&xxx;" is the HTML sequence and "outstr" is what you want to replace it with. The "outstr" portion can consist of regular non-space ASCII text characters (like "A" or "z") as well as hexadecimal values (in the form &Hxx) or decimal values (in the form \nnn). (See the BRUCEHEX.DOC file.) It can also be the word "NULL" which translates the string into nothing. You cannot use a space or equal sign in "outstr"; use the hexadecimal or decimal representations instead. The table does not have to be in any specified order. Lines can end with "/*" followed by a comment if you want. Examples: © = (c) /* Copyright symbol ° = ° é = é ê = ê è = è = \032 Remember that "&xxx;" entity references (yes, I hate that phrase) are case-sensitive in HTML. "°" will not find "&Deg;". There seems to be a trend of late to relax some of the replacement coding requirements in Web pages. The ";" is now, apparently, becoming optional. Numeric replacements (e.g. " ") seem to no longer require the leading pound sign. Therefore, HTMSTRIP looks for both of these iterations for any appropriate lookup. "©" will find "©" and "™" will find "&153". The lookup itself has to be entered in the formally correct way thoug